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Abstract 

This paper presents a set of techniques for providing 
fine-grained failover of long-running connections across 
a distributed collection of replica servers, and is espe- 
cially useful for fault-tolerant and load-balanced delivery 
of streaming media and telephony sessions. Our system 
achieves connection-level failover across both local- and 
wide-area server replication, without requiring a front- 
end transport- or application-layer switch. Our approach 
uses recently proposed end-to-end "connection migra- 
tion" mechanisms for transport protocols such as TCP, 
combined with a soft-state session synchronization pro- 
tocol between replica servers. 

The end result is a robust, fast, and fine-grained connec- 
tion failover mechanism that is transparent to client ap- 
plications, and largely transparent to the server applica- 
tions. We describe the details of our design and Linux 
implementation, as well as experimental data that sug- 
gests this approach is an attractive way to engineer ro- 
bust systems for distributing long-running streams; con- 
nections suffer relatively small performance degradation 
even when migration occurs every few seconds, and the 
associated server overhead is small. 

1 Introduction 

Ensuring a high degree of reliability and availability to an 
increasingly large client population is a significant chal- 
lenge facing Internet content providers and server oper- 
ators today. It is widely recognized that the computers 
serving Web content and streaming media on the Inter- 
net today do not possess the same impressive degree of 
reliability as other mission-critical services such as gate- 
ways and switches in the telephone network. 

An effective way to engineer a reliable system out of 
unreliable components is to use redundancy in some 
form, and server replication is the way in which reli- 
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able and available services are provided on the Web to- 
day. Because most Web connections last for relatively 
short periods of time, the problems of load-balancing 
client requests and recovering from unreachable replica 
servers can usually be handled at the granularity of 
a complete connection. Indeed, this is the approach 
seen in several current systems that perform server se- 
lection using a front-end transport- or application-layer 
switch [2, 6, 11, 18, 20], or using wide-area replication 
for Web content distribution [1,8]. 

While existing replication technologies provide adequate 
degrees of reliability for relatively short Web connec- 
tions, streaming media and Internet telephony display 
substantially longer transfer lengths. Providing reliable, 
robust service over long connections requires the ability 
to rapidly transition the client to a new server from an 
unresponsive, overloaded, or failed server during a con- 
nection [16]. The requirements of these emerging appli- 
cations motivate our work. 

We have designed and implemented a system that 
achieves fine-grained, connection-level failover across 
both local- and wide-area server replicas, without a front- 
end transport- or application-layer switch. Thus, there 
are no single points of failure or potential front-end bot- 
tlenecks in our architecture. We achieve this using a 
soft-state session synchronization protocol between the 
replica servers, combined with a connection resump- 
tion mechanism enabled by recently-developed end-to- 
end connection migration mechanisms for transport pro- 
tocols such as TCP [22] and SCTP [25]. The end result is 
a robust, fast, and fine-grained server failover mechanism 
that is transparent to the client application, and largely 
independent of the server applications. Applications that 
can benefit from this include long-running TCP connec- 
tions (e.g., HTTP, FTP transfers), Internet telephony, and 
streaming media, enabling them to achieve "mid-stream 
failover" functionality. 

Our architecture is end-to-end, with active participa- 
tion by the transport stack of all parties involved in 
the communication. Our design is largely application- 



independent: applications do not need to be modified to 
benefit from the fine-grained failover techniques. How- 
ever, because we allow a server to seamlessly take over a 
connection from another in the middle of a data stream, 
there needs to be some mechanism by which the servers 
synchronize application-level state between themselves. 
While this is a hard problem in general, there are many 
important common cases where it is not as difficult. For 
example, when each client request maps directly to data 
from a file, our lightweight synchronization mechanism 
performs quite well. If content is being generated dy- 
namically in a fashion that is not easily reproduced by 
another server, handoff becomes harder to accomplish 
without additional machinery. 

We discuss several issues involved in designing a con- 
nection failover mechanism in section 2. Section 3 de- 
scribes an end-to-end architecture for fine-grained server 
failover targeted at long-running transfers. Our TCP- 
based Linux implementation is described in section 4. 
Section 5 contains performance studies showing the ef- 
fectiveness of the failover mechanism and its resilience 
to imperfections in the health monitoring subsystem. We 
conclude with a synopsis of related work in section 6 and 
summarize our contributions in section 7. 

2 Components of a failover system 

This section describes three components that a complete 
fine-grained failover system should provide: 

(i) For any connection in progress, a method to deter- 
mine when (and if) to move it to another server; 

(ii) a selection process to identify a set of new server 
candidates; and 

(iii) a mechanism to move the connection and seam- 
lessly resume the data transfer from the new server. 

2.1 Health monitoring 

In general, the end-point of a connection is changed be- 
cause the current server is unresponsive; this may happen 
because the server is overloaded, has failed, or its path 
to the client has become congested. The failover sys- 
tem needs to detect this, following which a new server 
can be selected and the connection appropriately moved. 
We call the agent in the failover system that monitors the 
health and load of the servers the health monitor. 

There are several possible designs for a health monitor, 
and they can be broadly classified into centralized and 
distributed implementations. Our architecture accommo- 
dates both kinds. We believe, however, that it is better for 
the health monitor to be controlled by the servers than the 
client. It is often much harder for a client to have the req- 
uisite knowledge of the load on other servers, and putting 



it in control of making movement decisions may actually 
worsen the overall performance of the system. 

The focus of this paper is not on novel mechanisms for 
load or health monitoring; instead, we leverage related 
work that has already been done in this area [3, 12, 17]. 
For the purposes of this paper, we assume that each 
server in a support group is notified of server failure by 
an omniscient monitor at the same time. As the exper- 
imental results in section 5 demonstrate, however, our 
system allows the health monitoring component to be rel- 
atively simple — and even overly reactive — without sig- 
nificantly degrading performance. 

2.2 Server selection 

Once the system decides that a connection should be 
moved to another server, it must select a new server to 
handle the connection. One possibility is to use a content 
routing system and treat this as a new request to decide 
which server to hand it to. Another is to have the set of 
relatively unloaded servers attempt to take over the con- 
nection, and arrange for exactly one of them (ideally the 
closest one) to succeed. Our failover architecture admits 
both styles of server selection. 

2.3 Connection migration and resumption 

Once a connection has been targeted for movement and 
a new server has been selected to be the new end point, 
the client application should seamlessly continue without 
noticeable disruption or incorrect behavior. This requires 
that the application data stream resume from exactly 
where it left off at the old server. To achieve this in an 
application-independent fashion, the transport-layer state 
must be moved to the new server, and the application- 
layer state appropriately synchronized and restarted. 

There are different ways of doing this: one is a mech- 
anism integrated with the application where the clients 
and servers implement a protocol that allows the server to 
inform the client that its communicating peer will change 
to a new one. Then, the client application can terminate 
the connections to the current server and initiate them to 
the new one, and retrieve the portions of the stream start- 
ing from where the previous server left off. An alternate 
approach, which our work enables and advocates, is an 
application-independent mechanism by using a secure, 
transport-layer connection migration mechanism. The 
advantage of this method is that existing applications can 
benefit from this without modification, while new ones 
do not each need to incorporate their own mechanisms to 
achieve these results. 



3 Architecture 

Our architecture preserves the end-to-end semantics of a 
connection across moves between servers. Rather than 
inserting a Web switch or similar redirecting device, we 
associate each connection with a subset of the servers in 
the system 1 . This is the connection's support group, the 
collection of servers that are collectively responsible for 
the correct operation of the connection. Each support 
group uses a soft-state synchronization protocol to dis- 
tribute weakly consistent information about the connec- 
tion to each server in the group. This information allows 
a stream to resume from the correct offset after a move. 

When the health monitor determines that a connection 
should be moved, each of the remaining servers in the 
connection's support group becomes a candidate server 
for the orphaned client connection. Thus, the responsi- 
bility for providing back-up services to orphaned con- 
nections is shared equally among the other servers, and 
the new server is chosen from the support group in a de- 
centralized manner by the client. 

Determining the precise point of failure of a server is 
a difficult problem, but is fortunately one that does not 
need to be solved. It is sufficient to determine when 
the client believes the server failed. In the case of se- 
quenced byte streams, this can be represented by the 
last successfully received contiguous byte as conveyed in 
the transport-layer acknowledgment message. Hence the 
new server need only elicit an acknowledgement packet 
from the client in order to determine the appropriate 
point from which to resume transmission. 

The state distribution protocol periodically disseminates, 
for each connection, the mapping between the transport 
layer state and the application-level object being sent to 
the client (e.g., the TCP sequence number and an HTTP 
URL). For the remainder of this paper, we will use the 
term stream mapping to describe the task of translat- 
ing the particular transport-layer sequence information 
into application-level object references the server can 
understand. The transport-layer state is moved to the 
new server using a secure connection migration mech- 
anism [22]. Together, the techniques described above al- 
low for the correct resynchronization of both transport- 
layer and application state, transparent to the client ap- 
plication 

Figure 1 shows the basic architecture of our system for a 
simple configuration with two servers, A and B, in a sup- 
port group. When a client wishes to retrieve an object, it 
is initially directed to Server A through some server se- 
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Figure 1: Failover architecture for a support group of 
two servers. 

lection mechanism. The connection is observed by the 
stream mapper, which parses the initial object request 
and advertises the object currently being served and the 
necessary stream mapping information to the rest of the 
support group. 

At some point Server B may attempt to assume control 
of the connection. This may be caused by the receipt of 
a notification from the health monitor that Server A died, 
or may be initiated by a load-balancing policy mecha- 
nism. Server B initiates a connection migration by con- 
tacting the client in-band on the same connection, us- 
ing a secure transport-layer connection migration mech- 
anism. If it receives a transport-layer acknowledgement 
from the client, it knows that it was the candidate server 
selected by the client. Because this acknowledgment no- 
tifies the server of the next expected data byte, and be- 
cause the soft-state synchronization protocol in the sup- 
port group has already informed Server B of the content 
being served, content delivery can now be resumed at the 
correct point. Observe that at no point was the client ac- 
tively involved in the decision to migrate the connection, 
yet its transport-layer is fully aware that the migration 
took place and it even had the opportunity to pick one 
of several candidate servers 2 . All further client requests 
on the same connection are now directed to Server B, at 
least until the next migration event. 

In the rest of this section, we describe the formation and 
maintenance of support groups, the soft-state synchro- 
nization protocol, and the details of the transport-layer 
connection migration. 

3.1 Support groups 

The size and distribution of support groups clearly has 
a large impact on the performance of our scheme. As 
the number of servers in a support group increases, the 
increased load to any one server of a death in the group 
decreases. Unfortunately, the communication load also 
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increases, as each member of the group must advertise 
connection state to the others. 

Since the set of candidate servers for a client whose ini- 
tial server dies is bounded by the set of servers in the 
support group, it may be desirable to have a diversity 
of network locations represented in the support group to 
increase the chance the client has an efficient path to a 
candidate server. This depends greatly, however, on the 
effectiveness of the initial server selection mechanism, 
and the stability of that choice over the duration of the 
connection. We note that it may be desirable to limit the 
number of candidate servers that simultaneously attempt 
to contact the client in large support groups, as the im- 
plosion of migration requests may swamp the client. As 
a practical matter, the quadratic growth in communica- 
tion required between the servers in a support group will 
likely limit support group size to a reasonable number. 

Clearly, the choice of a live initial server is an impor- 
tant one, and much previous work has addressed meth- 
ods to select appropriate servers in the wide area. Good 
servers can be identified using BGP metrics [5], net- 
work maps [1], or application-specific metrics. Simi- 
larly, clients can be directed to these servers using DNS 
redirection [1], HTTP redirection, BGP advertisements, 
or anycast [9]. Since our architecture allows connections 
to be handed off to any other server at any point in the 
connection, the ramifications of a poor initial selection 
are not as severe as with current schemes. 

We construct support groups by generating a well-known 
hash of the server's IP address. By setting the number of 
hash buckets to n, servers are uniformly allocated into 
one of n support groups. This mechanism allows for the 
servers to be dynamically added and removed from the 
server pool. As each server comes on-line, it computes 
the hash of its interface address and then begins adver- 
tising its connections to a well-known multicast address 
for that hash value 3 . Every server in the support group 
becomes a candidate server for the connections from the 
new server immediately after receiving the first adver- 
tisement. Similarly, the new server begins listening to 
advertisements sent to the group's address, and becomes 
a candidate server for any connections advertised after its 
group membership 4 . 

The choice of support group membership and final server 
that handles a failed client should be engineered in 
a manner that avoids the "server implosion" problem, 
where all of the clients from a failed server converge 



3 The required multicast functionality may be provided at the IP 
layer or through an application-layer overlay. 

4 The initial request redirection mechanism must also be informed 
of the new server if it is to handle new client requests. 



on the same replacement server. We provide two mech- 
anisms to support implosion avoidance. The first is 
an engineering choice: by choosing smaller support 
group sizes (g members per support group out of n total 
servers), we can limit with high probability the expected 
number of clients that converge on a particular server to 
an 0(g/n) fraction of the clients from the dead server. 
The second is by delaying connection resumption by a 
load-dependent factor at each server, increasing the like- 
lihood the client is served by a less crowded server. We 
evaluate this technique in section 5. 

3.2 Soft-state synchronization 

The information advertised to the support group about 
each content stream can be divided into two parts. The 
first portion contains application-dependent information 
about the object being retrieved from the server, such as 
the HTTP URL of the client request. The second portion 
is the transport-layer information necessary to migrate 
the connection. In general, this includes the client's IP 
address, port number, and some protocol-specific trans- 
port layer state, such as the initial sequence number 
of the connection. The amount and type of transport- 
specific information varies from protocol to protocol. We 
are currently exploring a framework that describes the 
necessary information in a protocol-independent fashion. 

The state information for a particular connection may be 
unavailable at some servers in the support group because 
the connection has not yet been announced, or because 
all of the periodic announcements were lost. The first 
failure mode affects only a small window of new connec- 
tions, and can be masked as an initial connection estab- 
lishment failure. We assume that connections are long- 
running, so the second failure mode must be guarded 
against more carefully. If at least one server in the sup- 
port group has information about the connection, it can 
be re-established. We therefore need only bound the 
probability that no servers have fresh information about 
connection. It suffices to pick suitable advertisement fre- 
quencies and support group sizes (see, e.g., the analysis 
of a similar protocol by Raman et al. [19]), hence it is 
possible to achieve sufficient robustness in our synchro- 
nization protocol with markedly less complexity than a 
strongly-consistent mechanism. 

3.3 Transport-layer connection migration 

When attempting to resume a connection previously han- 
dled by another server in a fashion transparent to the 
client application, previous approaches have forged the 
server's IP address, making packets from the new server 
indistinguishable from the previous ones. This approach 
has two major drawbacks: 



• The new server and the failed server must be co- 
located. Since they both use the same IP address, 
packets from the client will be routed to the same 
point in the network. 

• The previous server cannot be allowed to return 
to the network at the same IP address, for oth- 
erwise there will be two hosts with the same ad- 
dress. Worse, if the initial server attempts to con- 
tinue serving the connection, confusion may ensue 
at the client's transport layer. 

Current approaches take advantage of the first require- 
ment to ameliorate the second. Since both the initial 
server and the failover server must be co-located, so- 
called layer-four switches or "Web switches" are placed 
in front of server groups. Web switches multiplex incom- 
ing requests across the servers, and rewrite addresses as 
packets pass through. This enables multiple servers to 
appear to the external network as one machine, provid- 
ing client transparency. The obvious drawback of this 
approach, however, is that all servers share fate with the 
switch, which may become a performance bottleneck. 

By using explicit connection migration, which exposes 
the change in server to the client, we remove both of 
these restrictions. (Indeed, our approach further empow- 
ers the client to take part in the selection of the new 
server.) Servers can now be replicated across the wide- 
area, and there is no requirement for a redirecting device 
on the path between client and server. 5 

Explicitly informing the client of the change in server 
provides robust behavior in the face of server resurrec- 
tion as well. If the previous server attempts any further 
unsolicited communication with the client — possibly due 
to network healing, server restart, or even a false death 
report by the health monitoring system, the client sim- 
ply rejects the communication in the same fashion as any 
other unsolicited network packet. 

Our architecture leverages the absence of the co-location 
requirement by allowing the client to select its candidate 
server of choice. This can be done in the transport layer 
by simply accepting the first migration message to arrive. 
Assuming that all servers in the support group are noti- 
fied of the current server's failure (section 2.1), the first 
request to arrive at the client is likely from the server best 
equipped to handle the message. The response time of a 
candidate server is the sum of the delay at the server and 
propagation delay of the request to the client. While not 
guaranteed to be the case due potentially unstable net- 
work conditions in the Internet, in general a candidate 



5 The stream mapper is not such a device; rather, it is a software 
module that allows a server application to participate in the soft-state 
synchronization protocol and connection resumption mechanism. 



server with good (low latency) connectivity and low load 
is likely to win out over a loaded candidate server with a 
poor route to the client. 

Clearly there are exceptions to this rule, but we believe 
that it is a reasonable starting point. We note that if a 
more sophisticated decision process is desired it can be 
implemented either at the candidate servers, since each is 
aware of not only the client to be contacted, but has a rea- 
sonable knowledge of the identities of the other members 
of the support group, or the client application, or both. 

3.3.1 Protocol requirements for migration 

Our architecture fundamentally requires the ability to 
migrate transport-level connections. While not widely 
available today, we believe that this capability is a pow- 
erful way to support both load-balancing and host mo- 
bility, and can be deployed incrementally by backward- 
compatible extensions to protocols like TCP [22] and as 
an inherent feature of new protocols like the Stream Con- 
trol Transport Protocol (SCTP) [25]. 

In addition to basic address-change negotiation, a mi- 
grateable transport protocol must provide two features 
to support our failover architecture: (i) it should be pos- 
sible for the migration to be securely initiated by a dif- 
ferent end-point from the ones used to establish the con- 
nection, and (ii) the transport mechanism must provide 
a method for extracting the sequence information of the 
last successfully received data at the receiver, so that the 
resumption can proceed correctly. Furthermore, the per- 
formance of our scheme is enhanced by the ability of 
multiple candidate servers to simultaneously attempt to 
migrate the connection, with the guarantee that at most 
one of them will succeed. We now address these issues 
in the context of TCP and SCTP. 

3.3.2 TCP migration 

While not a standard feature of TCP, extensions have 
been proposed to support connection migration, includ- 
ing the recently-proposed TCP Migrate Options [22, 23]. 
Using these options, correspondent hosts can preserve 
TCP connections across address changes by establishing 
a shared, local connection token for each connection. Ei- 
ther peer can negotiate migration to a new address by 
sending a special Migrate SYN segment containing the 
token of the previous connection from the new address. 
A migrating host does not need to know the IP addresses 
of its new attachment point(s) in advance. 

Each migration request includes a sequence number, 
and any requests with duplicate sequence numbers are 
ignored (actually, they are explicitly rejected through 
a RST segment). This provides our scheme with the 
needed at-most-once semantics — each candidate server 



sends a Migrate Request with the same sequence num- 
ber; the first packet to be received at the client "wins," 
and the rest are rejected. Furthermore, once the con- 
nection is migrated, packets from the previous address 
are similarly rejected, hence any attempt by the previ- 
ous server to service the client (perhaps due to network 
healing or a erroneous death report) are actively denied. 

The Migrate SYN as originally specified used the se- 
quence number of the last transmitted data segment. 
When packets are lost immediately preceding migration, 
retransmissions from the new address carry a sequence 
number earlier than the Migrate SYN. Unfortunately, 
several currently-deployed stateful firewalls block these 
seemingly spurious data segments, considering them to 
be a security risk. 

We remedy the situation by modifying the semantics of 
the Migrate SYN to include the sequence number of 
the last data segment successfully acknowledged by the 
client. This ensures that all data segments transmitted 
from the new address will be sequenced after the Migrate 
SYN. Further, since a host cannot reliably know what 
the last successfully acknowledged segment number is 
(since ACKs may be lost), we relax the enforcement of 
sequence number checking on Migrate S YNs. 

By allowing the Migrate SYN to fall outside of the cur- 
rent sequence space window, however, an attacker does 
not need to know the current sequence space of a con- 
nection to hijack it. Hence, in our extended model, only 
the secured variant of the Migrate Options provides pro- 
tection against hijacking by an eavesdropper. The se- 
cured form of the Migrate Options uses an Elliptic Curve 
Diffie-Hellman key exchange during the initial three- 
way handshake to negotiate a cryptographically-secure 
connection key. Secure Migrate SYNs must be cryp- 
tographically authenticated using this key, hence an at- 
tacker without knowledge of the connection key cannot 
hijack a connection regardless of the current sequence 
space. 

This relaxation allows any server with sufficient knowl- 
edge of the initial transport state (namely the initial se- 
quence number, connection token, and key) to request 
a migration. When a server with stale transport layer 
state assumes control of a connection, however, the client 
needs to flush its SACK blocks (and corresponding out- 
of-order packets) for proper operation of the TCP stack 
at the new server. 6 Otherwise, the transmission of a data 
segment that fills a gap in previously-transmitted out-of- 
order segments will cause the client to acknowledge re- 
ceipt of data that the new server has not yet sent, con- 
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founding the server's TCP (and an untold number of mid- 
dle boxes). 

In general, a host cannot deduce the difference between 
a Migrate SYN issued by the original host (from a new 
address) that simply failed to receive some number of 
ACKs and a new end point. We therefore reserve one 
bit from the Migrate SYN to flag when a Migrate Re- 
quest is coming from a host other than the one that ini- 
tiated the connection [23]. This allows TCP connections 
migrated by the same host to avoid the negative perfor- 
mance implications of discarding out-of-order segments 
while providing correct operation in the face of end host 
changeover. 

3.3.3 SCTP migration 

Recent work in IP telephony signaling has motivated the 
development of a new transport protocol by the transport 
area of the IETF (Internet Engineering Task Force). A 
proposed standard, the Stream Control Transport Proto- 
col (SCTP), provides advantages over TCP for connec- 
tions to multi-homed hosts [25]. By advertising a set of 
IP addresses during connection establishment, a multi- 
homed SCTP connection supports transmitting and re- 
ceiving data on multiple interfaces. We can leverage this 
capability to support connection migration between dif- 
ferent servers. 

Unlike the TCP Migrate options, however, all addresses 
to be used for an SCTP connection must be specified at 
connection establishment. While limiting the level of dy- 
namism in the server pool, it still supports failover be- 
tween servers that were known to the initial server at 
the time of connection establishment. A recent Inter- 
net Draft [26] attempts to address this issue by allow- 
ing for the dynamic addition and deletion of IP addresses 
from the association, although it requires the operations 
to be initiated by an end point already within the associa- 
tion. Additionally, SCTP was designed to support multi- 
homed hosts, as opposed to address changes, hence it 
does not support the at-most-once semantics required to 
allow multiple servers to simultaneously attempt to mi- 
grate the connection. We believe this is a deficiency in 
SCTP that can be addressed simultaneously while adding 
support for dynamic changes to the address associations. 

The additional complexity of SCTP requires servers 
to communicate more transport-level state information, 
but need not increase the frequency of communication. 
SCTP end-points are required to emit SACK packets 
upon receipt of duplicate data segments, hence the new 
server could send a data segment with a Transport Se- 
quence Number (TSN) that is known to be stale, eliciting 
a SACK with the current sequence state. 



3.4 Limitations 

Our architecture depends on the ability to perform the 
stream mapping function for objects being requested. 
Due to variability in header lengths, this requires access 
to the transport layer immediately below the application. 
In many cases, there is only one (non-trivial) transport 
protocol in use, such as TCP or SCTP. In some instances, 
however, transport protocols may be layered on top of 
each other, such as RTP over TCP. In this instance, both 
migration and stream mapping must be performed at the 
highest level, namely RTP. 

Independent of the transport- and network-layer issues 
handled by our architecture, particular applications may 
attempt to enforce semantics that are violated by a server 
change; clearly such applications cannot be handled in 
a transparent fashion. In particular, the object being 
served may have changed since the connection was ini- 
tially opened, resulting in indeterminate behavior if the 
application isn't aware of this. Further, some applica- 
tions make decisions at connection establishment based 
on server- or time-sensitive state, and do not normally 
continue to reevaluate these conditions. For instance, 
current applications may make authorization decisions 
based on the IP address of its original peer, or an HTTP 
cookie or client certificate that has since expired. Such 
applications can be notified of the connection migration 
by the stream mapper, however, and allowed to perform 
any required steps to resynchronize themselves before re- 
suming transmission. 

Despite these limitations, the common case of a long- 
running uni-directional download from a static file or 
consistent stream is handled by our architecture in an 
application-transparent manner. As described in the fol- 
lowing sections, our implementation allows a connection 
to be migrated at any point after the initial object re- 
quest completes. Furthermore, our scheme can survive 
cascaded failures (when the back-up server fails before 
completing the connection migration) due to the seman- 
tics of our state distribution mechanism and the robust- 
ness of the transport-layer migration functionality. 

4 Implementation 

Our current prototype implementation supports TCP ap- 
plications using the Migrate options [22, 23]. The soft- 
state distribution mechanism and stream mapping func- 
tionality are implemented as a wedge that runs on the 
server and proxies connections for the local content 
server. Our current implementation uses Apache 1.3, and 
is compatible with any Web server software that supports 
HTTP/ 1.1 range requests. It is possible to optimize per- 
formance by handing off the TCP relaying to the kernel 



after the request parameters have been determined, in a 
manner akin to MSOCKS [14]. 

4.1 The wedge 

The wedge is a TCP relay that accepts inbound connec- 
tions from clients, and forwards them to the local content 
server. It listens in on the initial portion of a connection 
to identify the object being requested by the client, exam- 
ines the returned object to determine the parameters nec- 
essary to resume the connection if necessary, and then 
hands the relaying off to a generic TCP splice to pass 
the data between the client and the server. The use of a 
user-level splice to copy data from one TCP connection 
(server-to-wedge) to another (wedge-to-client) results in 
some performance degradation compared to an in-server 
or in-kernel implementation, but permits a simple and 
clean implementation that can be used with a variety of 
back-end servers, not just one modified server. 

Each application protocol handled by the wedge requires 
a parsing module to identify the requested object and 
strip out any protocol headers on the resumed connec- 
tion. The architecture of the wedge in the context of 
HTTP is shown in figure 2. 

The wedge first passes the connection through the HTTP 
GET parser, which watches the connection for a GET re- 
quest to identify the requested object. The parser passes 
the request along to the back-end server, and hands off 
further control of the connection to the HTTP header 
parser. The parser counts the length of the HTTP headers 
to identify the offset in the data stream of the beginning 
of the object data, and forwards the headers back to the 
client. It then passes control to the generic data relay, 
which maintains control of the connection until termina- 
tion. 

Finally, a protocol-independent soft-state distributor pe- 
riodically examines the set of active connections and 
sends information about each connection (over UDP) to 
its support group. This information includes: 

• Client IP address and port number. 

• Takeover sequence number. 

• TCP Migrate fields (connection token, key, etc.). 

• Application-specific object parameters, including 
the name of the object. 

Each server maintains a table of all the connections in 
its support group(s). When a server receives notification 
that a peer has failed, it determines if it is in the sup- 
port group for any connections previously managed by 
the dead server (using the hashing mechanism described 
in section 3.1), and attempts to resume the connections. 

To do so, it migrates the client connection (section 4.2), 
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Figure 2: The wedge handling a new connection (left) and taking over an existing connection (right). 



computes the new offset into the data stream by compar- 
ing the current TCP sequence number to the connection's 
initial sequence number, and then passes the client con- 
nection to the protocol re-establishment module to con- 
tinue the connection. For HTTP, the re-establishment 
module sends an HTTP range request to the local server, 
strips out the protocol headers, and then relays the data 
to the client, seamlessly resuming the transfer. 

To avoid race conditions, connection migrations are se- 
quenced. If a server hears an announcement for a con- 
nection it believes it is currently managing, it checks the 
sequence number. If that number is greater than its own, 
it can safely infer that the connection has been migrated 
elsewhere (due to a load-balancing policy decision or an 
erroneous death report), and it may terminate its connec- 
tion and stop advertising it to others. Similarly, peers 
accept only the most recent announcement for a connec- 
tion, facilitating convergence after a migration. 

4.1.1 Soft-state information dissemination 

Soft-state information updates are sent every UP- 
DATE_SEC seconds, and expired from remote servers af- 
ter CACHE.TIMEOUT seconds. The proper values for 
these parameters depend on several factors [19]: connec- 
tion lengths, packet loss rates between servers, and con- 
nection frequency. Our default values for these are an 
update frequency of three seconds and a timeout period 
of 10 seconds, which result in acceptable update frequen- 
cies when serving typical streaming media connections, 
without undue state dissemination overhead. The opti- 
mal values for these figures would, of course, vary from 
site to site. 

4.1.2 Persistent connections 

Our wedge implementation does not currently support 
persistent connections, but could do so with only mi- 
nor modification. The wedge would need to continu- 
ally monitor the client side of the connection (instead of 
blindly handing the connection off to a splice) and watch 
for further object requests. When it receives a new re- 
quest, the wedge would begin announcing it via the soft- 
state distribution protocol. Using techniques from the 
LARD persistent connection handler [4], the splice can 



still perform fast relaying of the bulk data from the server 
to the client. An in-server implementation of the soft- 
state dissemination protocol would of course eliminate 
this need. 

4.2 Socket interface 

In order for a new server to handle a migrated connec- 
tion, the wedge must preload a socket with sufficient 
transport information. Conversely, this information must 
be extracted from the socket at the previous server and 
communicated to the new server. We have implemented 
two new system calls, setsockstateQ and getsockstate(), 
to provide this functionality. 

The getsockstate() call packages up the TCP control 
block of an existing TCP socket, while the setsockstate() 
call injects this information into an unconnected TCP 
socket. Only certain parts of the control block are rel- 
evant, however, such as the sequence space information, 
TCP options, and any user-specified options such as an 
MTU clamp. Other, unportable state such as the retrans- 
mission queue, timers, and congestion window are re- 
turned by getsockstate(), but not installed into the new 
socket by setsockstate(). Invoking the connect() system 
call will then cause the socket to attempt to migrate the 
connection (as specified by the preloaded state) to the 
new socket. 

After migration is completed (the connect() call suc- 
ceeds), the new server can compare the current sequence 
number to the initial sequence number returned by get- 
sockstate() to determine how many bytes have been sent 
on the connection since its establishment. 

4.3 Migration 

Figure 3 presents a tcpdump trace of a failover event, 
collected at the client. There are four hosts in this ex- 
ample: the client, cl, and three servers, sA, sB, and sC. 
To simulate a diverse set of realistic network conditions, 
the servers are routed over distinct DummyNet [2 1 ] pipes 
with round trip times of approximately 10ms, 40ms, and 
200ms respectively. Each pipe has a bottleneck band- 
width of 128Kb per second. Initially the client is retriev- 
ing an object from sA. After some period of time, an 



Initial Data Transmission: 

0.00000 cl.1065 > SA.8080: . ack 0505 win 31856 

^— ^ (Erroneous) sA Death Pronouncement Issued ^— ^ 

0.08014 SA.8080 > cl.1065: P 0505:1953(1118) ack 1 win 31856 

Successful Connection Migration to SB: 



0.09515 sB.1033 > cl.1065: 
1> 

0.09583 cl.1065 > sB.1033: 

0.14244 SB. 1033 > cl.1065: 



S 0:0(0) win <migrate PRELOAD 



S 0:0(0) ack 1953 win 32120 

ack 1 win 32120 



Continued Data Transmission from sA: 



0.17370 sA.8080 > cl.1065: 
0.17376 cl.1065 > sA.8080: 



P 0505:1953 (1118) ack 1 
R 1:1(0) win 



lin 31856 



Failed Connection Migration Attempt by sC: 

0.17423 sC.1199 > cl.1065: S 0:0(0) win <migrate PRELOAD 
1> 
0.17450 cl.1065 > sC.1499: R 0:0(0) ack 1 win 

Resumed Data Transmission from sB: 



0.24073 sB.1033 > cl.1065 

0.25663 cl.1065 > sB.1033 

0.33430 SB. 1033 > cl.1065 

0.42776 SB. 1033 > cl.1065 

0.42784 cl.1065 > sB.1033 



P 1953:3113(1160) ack 1 win 32120 
. ack 3413 win 31856 
P 3113:1873(1160) ack 1 win 32120 
P 1873:6333(1160) ack 1 win 32120 
. ack 6333 win 31856 



Figure 3: An annotated failover trace (collected at the 
client) depicting the migration of a connection to one 
of two candidate servers. 



erroneous death pronouncement is simultaneously deliv- 
ered to the servers by a simulated health monitor that de- 
clares sA dead. This pronouncement is received by the 
servers at approximately 0.073s (not shown). 

As figure 3 shows, each of the other servers in the sup- 
port group immediately attempts to migrate the connec- 
tion. Due to their disparate path latencies, the Migrate 
SYNs arrive at different times. With a path latency of 
only 20ms, the Migrate SYN from sB arrives first, and is 
accepted by the client. 

The next section of the trace shows the robustness of 
our scheme in the face of mistaken death pronounce- 
ments. The previous simulated announcement by the 
health monitor was in error; sA is in fact still operational. 
Furthermore, there are several outstanding unacknowl- 
edged data segments (including the segment seen in the 
trace at time 0.08014), as the client does not emit any 
further ACKs to sA once it has migrated to sB. Hence 
sA times out and retransmits the most recent data seg- 
ment. The client responds by sending a RST segment, 
informing sA it is no longer interested in receiving fur- 
ther transmissions. (The remaining retransmissions and 
corresponding RSTs are not shown for clarity.) 

Continuing on, we see that the Migrate SYN from the 
other candidate server, sC finally arrives approximately 
100ms after the death announcement. Since the Mi- 
grate Request number (1) is identical to the previously re- 
ceived Migrate SYN, the client rejects the request. Note 
that if sB had died, and this Migrate SYN was an attempt 



by sC to further resume the connection, the request num- 
ber would have been incremented. 

The final portion of the trace shows the resumed data 
transmission, continuing from the last contiguous re- 
ceived data segment (as indicated by the SYN/ ACK sent 
by the client). Since the path from client to sB is likely 
different from the path to sA, the TCP congestion state is 
reset and the connection proceeds in slow start. 

5 Performance 

We conducted several experiments to study the robust- 
ness of our scheme in the presence of overly-reactive or 
ill-behaving health monitors, the overhead incurred, and 
the consequences of many connections requiring simul- 
taneous migration. 

5.1 Server stability 

We first examined the performance degradation experi- 
enced by a connection as a function of the rate at which 
it is migrated between different servers. The lower the 
impact, the greater the resilience of our scheme to an im- 
perfect health monitor or unstable load-balancing policy. 
In particular, we would like to isolate the highest migra- 
tion frequency before performance severely degrades. 

One might assume that performance degradation would 
increase steadily as the frequency of oscillations between 
servers increases. Hence we conducted a series of sim- 
ple experiments where a client was connected to two 
servers over distinct links, each with a round-trip propa- 
gation time of 40ms and distinct bottleneck bandwidths 
of 128Kbps. The bottleneck queue size from both servers 
was 14 KBytes, substantially larger than the bandwidth- 
delay product of the path. All graphs in this section rep- 
resent data collected at the client. 

Contrary to our initial intuition, we find that the degra- 
dation is non-monotonic in the oscillation frequency for 
this experiment. This is shown in figure 4, which de- 
picts the progression of five separate downloads, each 
subjected to a different rate of oscillation. The connec- 
tion served entirely by one server performs best, but the 
other rates deviate in an unexpected manner. 

While the traces and exact numbers we present are spe- 
cific to our link parameters, they illustrate three impor- 
tant interactions. The first is intuitive: the longer the time 
between server change events, the higher the throughput, 
because there is less disruption. This explains the de- 
creasing overall trend and the decreasing magnitude of 
the "bumps" in figure 5. The second effect is due to 
the window growth during slow start; if migrations oc- 
cur before the link bandwidth is fully utilized, through- 
put decreases dramatically because the connection al- 
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Figure 4: Connection ACK traces for varying rates of 
server oscillation. 



Figure 6: Sequence traces of oscillatory migration be- 
havior. 
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Figure 5: Download times vs. oscillation rates. A con- 
nection served entirely by one server takes 61.96 sec- 
onds to complete. 

ways under-utilizes the link. This occurs at at oscilla- 
tion periods less than three seconds in the figure 5. The 
third interaction occurs when migration occurs during 
TCP loss recovery, either due to slow start or conges- 
tion avoidance. In this case, the go-back-n retransmis- 
sion policy during migration forces the connection to dis- 
card already-received data. This interaction explains the 
periodic "bumps" in figure 5 and is discussed in more 
detail below. 

To illustrate the slow-start and loss recovery interactions, 
figure 6 examines the sequence traces for the interval 
from 35 to 40 seconds of the connections subjected to 2 
and 5 second oscillations. At 2 seconds, connections are 
still ramping up their window sizes and have experienced 
no losses. At 5 seconds, the connections experience mul- 
tiple loss events as slow start begins to overrun the buffer 
at the bottleneck. Four retransmissions can be observed 
to be successfully received, the fifth unfortunately arrives 
just after the Migrate SYN from the new server. Since 
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Figure 7: The request overhead of the wedge as a 
function of request size. 

the remaining data is non-contiguous, it is flushed at the 
receiver in accordance with the go-back-n policy, and re- 
transmission resumes from substantially earlier. Similar 
loss interactions appear as the regular pauses in figure 4. 
Regardless of the exact period of interaction, the server 
switching overhead for realistic rates of failure is quite 
low. 

5.2 Overhead 

Micro-benchmarks of the request fulfillment time for an 
unloaded server are shown in figure 7. The overhead as- 
sociated with wedge processing becomes negligible as 
request size increases to the long-running, large sessions 
for which our scheme is designed. The impact ranges 
from an additional 1ms per request (80% overhead) for 1 
KByte requests to 12ms (1%) for 8 MByte requests. 

The load overhead from the wedge comes from our sim- 
ple TCP splicing implementation; we do not present its 
evaluation here. The kernel-assisted techniques men- 
tioned in section 4 or the TCP splicing techniques de- 



Method 


Avg. Latency 


Avg. Max Clients 


Distance 


34ms 


2146 


Optimal 


38ms 


1399 


Backoff 


67ms 


1334 


Round-Robin 


94ms 


1112 


Random 


94ms 


1160 



Table 1: Simulation distances and node with most 
clients for different server selection methods. 

scribed by Cohen, Rangarajan, and Slye [7] can elimi- 
nate most of the TCP processing overhead, but not the 
connection establishment overhead; an in-server imple- 
mentation can eliminate nearly all of the overhead. 

5.3 Implosion 

To explore the degree to which latency-dependent server 
selection affected server implosion, we simulated 10,000 
clients served by 10 servers. (Other numbers of clients 
and servers had nearly identical results). The clients and 
servers were laid out on a random two-dimensional grid 
to represent the distance between them; possible laten- 
cies ranged from 0ms to 250ms. 

We first tested two global methods. Optimal caps 
server occupancy at 1400 clients and performs a global 
minimum-latency assignment to servers. Round-Robin 
guarantees that clients are evenly distributed between 
(effectively random) servers. Next, we simulated three 
distributed methods. Distance uses only the distance 
metric to assign servers. Random chooses servers purely 
at random from the failure group. Backoff uses an expo- 
nential backoff based on the number of clients hosted at 
the server and the number of outstanding "offers." 

The performace of each server selection method is shown 
in table 5.3. Pure distance assignment has the best la- 
tency, but suffers from severe implosion effects. Optimal 
does nearly as well without the implosion, but has feasi- 
bility issues in a distributed environment. Round-Robin 
and random assignment both result in very even distribu- 
tions of clients, but have high latency. As expected, our 
weighted backoff method achieves a nice compromise. 

6 Related work 

Request redirection devices can perform per-connection 
load balancing, but achieve better performance and flex- 
ibility if they route requests based upon the content of 
the request. Many commercially available Web switches 
provide content-based request redirection [2, 6, 11, 18] 
by terminating the client's TCP connection at a front-end 
redirector. This redirector interprets the object request in 
a manner similar to our wedge, and then passes the re- 
quest on to the appropriate server. In this architecture, 



however, the redirector must remain on the path for the 
duration of the connection, since the connection to the 
server must be spliced together with the client's connec- 
tion. 

In-line redirection inserts a potential performance bot- 
tleneck, and the benefits of TCP splicing [7] and hand- 
ing off connections directly to end machines within a 
Web server cluster are well known. Handoff mechanisms 
were first explored by Hunt, Nahum, and Tracey [13], 
and later implemented in the LARD (Locality -Aware Re- 
quest Distribution) system [17]. LARD was recently ex- 
tended [4] to support request handoff between backend 
servers for HTTP/1.1 persistent connections [10]. A key 
component of the LARD implementation is a TCP hand- 
off mechanism, which allows the front-end load balancer 
to hand the connection off to back-end servers after the 
load-balancing decision has been made. Similar func- 
tionality can now also be found in a commercial product 
from Resonate [20]. While all three of these mechanisms 
are transparent to the client, they each require the con- 
nection to be actively handed off by the front end to the 
back-end server. 7 

The need for previous techniques to maintain hard con- 
nection state at the front end has made developing Inter- 
net telephony systems with reliability equivalent to cur- 
rent circuit-switched technologies quite difficult. A new 
Reliable Server Pooling (Rserpool) working group has 
recently been formed by the IETF to examine the needs 
of such applications. We believe our architecture ad- 
dresses many of the requirements set forth in the working 
group charter [16] and described in the Aggregate Server 
Access Protocol (ASAP) Internet Draft [24]. 

7 Conclusion 

We described the design and implementation of a fine- 
grained failover architecture using transport-layer con- 
nection migration and an application-layer soft-state syn- 
chronization mechanism. Our architecture is end-to-end 
and transparent to client applications. It requires deploy- 
ment of previously-proposed changes to only the trans- 
port protocol at the communicating peers, but leaves 
server applications largely unchanged. Because it does 
not use a front-end application- or transport-layer switch, 
it permits the wide-area distribution of each connection's 
support group. 

Experimental results of our prototype Linux implemen- 
tation show that the performance of our failover system 
is not severely affected even when connection halts and 



7 Back-end forwarding [4] allows different back ends to serve subse- 
quent requests, but requires the previous server to forward the content 
(through the front end) back to the client. 



resumptions occur every few seconds. Performance de- 
creases only marginally with increasing migration fre- 
quency, with an additional contribution dependent on the 
loss rate of the connection immediately preceding migra- 
tion. We therefore believe that this approach to failover 
is an attractive way to build robust systems for delivering 
long-running Internet streams. 

The techniques described in this paper are applicable to 
a variety of systems, in addition to the traditional end- 
to-end browser-to-server model. Although our archi- 
tecture does not require application- or transport-layer 
switches for routing and health monitoring, it does not 
preclude them. For example, some commercial prod- 
ucts (e.g., Cisco LocalDirector [6]) suggest using two co- 
located Web switches for redundancy. When used across 
long-running connections, our solution allows existing 
connections to be seamlessly migrated between multiple 
Web switches without adversely affecting performance. 

Our source code is available (under GPL) at http : // 
nms . lcs . mit . edu/ software/migrate/. 
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